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Abstract 

The dictionary matching problem is to locate occurrences of any pattern among a set of patterns in 
a given text. Massive data sets abound and at the same time, there are many settings in which working 
space is extremely limited. We introduce dictionary matching software for the space-constrained en- 
vironment whose running time is close to linear. We use the compressed suffix tree as the underlying 
data structure of our algorithm, thus, the working space of our algorithm is proportional to the optimal 
compression of the dictionary. We also contribute a succinct tool for performing constant-time lowest 
marked ancestor queries on a tree that is succinctly encoded as a sequence of balanced parentheses, with 
linear time preprocessing of the tree. This tool should be useful in many other applications. 
Our source code is available at |http://www.sci.brooklyn.cuny.edu/~sokol/dictmatch.html| 

1 Introduction 

In recent years, there has been a massive proliferation of digital data. Concurrently, industry has been 
producing equipment with ever-decreasing hardware availability. Thus, we are faced with scenarios in which 
this data growth must be accessible to applications running on devices that have reduced storage capacity, 
such as mobile and satellite devices. Hardware resources are more limited, yet the user's expectations of 
software capability continue to escalate. This unprecedented rate of digital data accumulation therefore 
presents a constant challenge to the algorithms and software developers who must work with a shrinking 
hardware capacity. 

The dictionary matching problem is to identify a set of patterns, called a dictionary, within a given text. 
Applications for this problem include searching for specific phrases in a book, scanning a file for virus 
signatures, and network intrusion detection. The problem also has applications in the biological sciences, 
such as searching through a DNA sequence for a set of motifs, identifying motifs to characterize protein 
families, and finding anchors for fast alignment of large genomic sequences. 

A series of dictionary matching algorithms that operate in small space have in fact been developed 
IS [m 121 [131. The latest of these results fTT] achieved time and space optimal ID dictionary matching. That 
is, their algorithm runs in linear time within space that meets empirical entropy bounds of the dictionary. 
The empirical entropy of a string {Hq or H^) describes the minimum number of bits that are needed to 
encode the string within context. 
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Succinct dictionary matching algorithms have remained in the theoretical realm and have not been im- 
plemented until now. This work fills the void. We have developed software for dictionary matching in small 
space that relies on the compressed suffix tree, a popular succinct data structure. Our main challenge lay 
in combining a dictionary matching algorithm for the generalized suffix tree with compressed suffix tree 
representations. 

Chan et al. developed the first succinct dictionary matching algorithm Their algorithm uses the 
compressed suffix tree of Sadakane [211, which they extended so that it can support a dynamically changing 
dictionary of patterns. Hon et al. presented a more space-efficient dictionary matching algorithm that uses 
a sampling technique to compress a suffix tree |[T4l . Their algorithm uses several data structures along 
with a compressed representation of the suffix tree, among them the string B-tree, a compressed trie of the 
patterns, and an LCP array of the longest common prefixes between the patterns. Our software uses only the 
compressed suffix tree, augmented with a succinct framework for lowest marked ancestor queries. 

We also contribute a succinct tool for performing lowest marked ancestor queries in constant time after 
linear time preprocessing of a compressed suffix tree. The compressed suffix tree is augmented by a bit 
vector and a sequence of balanced parentheses. Lowest marked ancestor queries are answered by a set of 
constant- time queries to these data structures. The lowest marked ancestor structure that we implemented 
is appropriate for any succinct representation of an ordered tree that encodes the structure as a sequence of 
balanced parentheses, which was introduced by Jacobson [15|. Thus, our tool for lowest marked ancestor 
queries is a contribution that is useful to other applications that use the balanced parentheses representation 
of a tree. 

We begin with an overview of compressed suffix trees in Section |2l since they are the basis of our suc- 
cinct dictionary matching software. In Section |3] we describe our linear-time dictionary matching software 
that relies on the uncompressed suffix tree. Then, in Section IH we describe the techniques employed by 
our succinct dictionary matching software, and the succinct framework we implemented for lowest marked 
ancestor queries on a compressed suffix tree. In Section |5] we present experimental results to demonstrate 
that these techniques are in fact space-efficient. We conclude with a summary and direction for future work 
in Section [6l 

2 Compressed Suffix Tree 

We begin with a description of the suffix tree and of compressed representations of the suffix tree since our 
program relies on the compressed suffix tree as its underlying data structure. The suffix tree is a compact 
trie that represents all suffixes of an input string. The suffix tree for S = siS2 • • • is a rooted, directed 
tree with n leaves, one for each suffix. Each internal node, except the root, has at least two children. Each 
edge is labeled with a nonempty substring of S and no two edges emanating from a node begin with the 
same character. The path from the root to leaf i spells out suffix S[i . . .n]. Suffix links allow an algorithm 
to move quickly to a distant part of the tree. A suffix link is a pointer from an internal node labeled xa to 
another internal node labeled a, where x is an arbitrary character and a is a possibly empty substring. The 
suffix array is a data structure that indexes a string by storing the lexicographical order of its suffixes. 

Recent innovations in succinct full-text indexing provide us with the ability to compress both a suffix 
array and a suffix tree, using space that is proportional to the optimal compression of the data they are built 
upon. These self-indexes can replace the original text, as they support retrieval of the original text, in addi- 
tion to answering queries about the data very quickly. Several compressed suffix array (CSA) representations 
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Space (bits) 


Slowdown 


Reference 


0{l\ogi) 


0(1) 


Uncompressed suffix tree 


0{l\oga) 


0{polylog{e)) 


Sadakane 1.21 J 


eHk{T) + o{noga) 


O(log^) 


Russo et al. |[T8| 


{l + ^jmk{T)+o{noga) 


0(log' ^), < e < 1 


Fisclier et al. Q 


\CSA\ + \CLCP\ + 3i 


0(1) for many operations 


Oiilebuscii et al. HSl 



Table 1 : Compressed suffix tree representations for an input string T of length £, where | CSA \ is the number of bits 
used to store the compressed suffix array and \CLCP\ is the number of bits occupied by the compressed LCP array. 

exist, e.g., llIOlfTTH TllSl. each with a different time-space trade-off. The most recent results meet fcth order 
empirical entropy of the input string. Compressed representations of the suffix tree, e.g., ||2T1 [TSl |9] [161 . 
use the compressed suffix array as a component. Table \T\ summarizes the time-space trade-offs in several 
compressed suffix tree (CST) representations. 

Ohlebusch et al. |fT6l recognized that the compressed suffix tree generally consists of three separate 
parts: the lexicographical information in a compressed suffix array (CSA), the information about common 
substrings in the longest common prefix array (LCP), and the tree topology combined with a navigational 
structure (NAV). Each of these three components functions independently from the others and is stored 
separately. Representations of compressed suffix arrays and compressed LCP arrays are interchangeable in 
many compressed suffix tree representations. Combining the different representations of each component 
yields a rich variety of compressed suffix trees, although some compressed suffix trees favor certain com- 
pressed suffix array or compressed LCP array representations. The Succinct Data Structures Library (SDSL) 
provides a range of compressed suffix tree implementations, which we used in our dictionary matching soft- 
ware. We experimented with Sadakane's compressed suffix tree flT] by using an assortment of compressed 
suffix array and compressed LCP modules to achieve different time and space complexities in our dictionary 
matching software. 

3 Linear- Time Dictionary Matching with Suffix Tree 

We first developed a linear-time dictionary matching program that uses the uncompressed suffix tree as its 
primary data structure. Then we modified our approach to use the compressed suffix tree to improve the 
space complexity. In this section we describe the linear time dictionary matching software that uses an 
uncompressed suffix tree. Then, in the next section, we delineate the revisions in our techniques so that we 
perform dictionary matching using compressed suffix tree representations. 

A suffix tree can be used to index several strings, in a generalized suffix tree. The dictionary can be 
merged to form a single string by concatenating the patterns with a unique delimiter separating them. Be- 
cause it is online, Ukkonen's suffix tree construction algorithm can insert one string at a time and index 
only the actual suffixes of a set of strings in a suffix tree |[T2l . The dictionary of patterns is indexed by a 
generalized suffix tree to preprocess it for dictionary matching queries. Then, the text is searched for pattern 
occurrences in linear time, in a manner similar to Ukkonen's insertion of a new string to the suffix tree. We 
briefly summarize Ukknonen's suffix tree construction algorithm in the following paragraph, and depict its 
steps in Algorithm [T] 

The elegance of Ukkonen's algorithm is evident in its key property. The algorithm admits the arrival of 
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Algorithm 1 Ukkonen's suffix tree construction algoritiim 

j = -i; 

{j is last suffix inserted} 
for z = to n — 1 do 

{phase i: i is current end of string} 
whUe j < ido 

{let j catch up to i} 

if singleExtensionAlgorithm(i, j) tlien 

break {implicit suffix so proceed to next phase} 
end if 

if lastN odelnserted ^ root tlien 

lastNodelnserted.Suf fixLink <— root 
end if 

lastN odelnserted <— root 
end wliile 
end for 



the string during construction. Yet, each suffix is inserted exactly once, and a leaf is never updated after its 
creation. As a new character is appended to the input string, Ukkonens's algorithm ensures that all suffixes 
of the string are indexed by the tree. As soon as a suffix is implicitly found in the tree, modification of the 
tree ceases until the next new character is examined. The next phase begins by extending the implicit suffix 
with the new character. Using suffix links and a pointer to the last suffix inserted, each suffix is inserted in 
the tree in amortized constant time. The combination of one-time insertion of each suffix and rapid suffix 
insertion results in an overall linear- time suffix tree construction algorithm. 

Our dictionary matching algorithm over a generalized suffix tree of patterns was inspired by Ukkonen's 
process for inserting a new string into a generalized suffix tree (as shown in Algorithm [T), pretending to 
index the text, without modifying the index. The text is processed in an online fashion, traversing the suffix 
tree of patterns as each successive character of text is processed. A pattern occurrence is announced when a 
labeled leaf is encountered, i.e., a leaf that represents the first suffix of a pattern. At a position of mismatch 
and at a pattern occurrence, suffix links are used to navigate to successively smaller suffixes of the matching 
string. When a suffix link is used within the label of a node, the corresponding number of characters can 
be skipped, obviating redundant character comparisons. In the spirit of Ukkonen's skip-count trick, this 
ensures that the text is scanned in linear time. 

The skip-count trick is based on Lemma [T] A suffix link is a directed edge from the internal node at the 
end of the path labeled xa to another internal node at the end of the path labeled a, where x is an arbitrary 
character and a is a possibly empty substring. We can similarly define suffix links for leaves in the tree. The 
suffix link of the leaf representing suffix i points to the leaf representing suffix i + 1. 

Lemma 1 021/ In a suffix tree, the number of nodes along the path labeled a is at least as many as the 
number of nodes along the path labeled xa. 

Proof: Suppose not. That is, there is some a for which the path labeled a has fewer nodes than the path 
labeled xa. This means that some suffix of xa is not indexed by the suffix tree. This implies that the suffix 
tree is not fully constructed. Hence, a contradiction and the premise must be valid. ■ 

Corollary 1 If the suffix link of the root points to itself every node of the suffix tree has a suffix link. 
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Ukkonen uses suffix links to navigate across a suffix tree and then skip over the appropriate number of 
characters labeling the beginning of an edge. In dictionary matching, we navigate a fully constructed suffix 
tree, and every node must have a suffix link established for it. To avoid redundant comparisons, we follow a 
suffix link across a suffix tree and then jump up to the position at which the mismatch occurred. It is more 
efficient to navigate up a suffix tree than down. That is, every node has a single parent but when navigating 
to a child, several branches can be considered. When a suffix link is traversed, we know the number of 
characters to skip going up the edge as this is the number of characters that remain along the edge after the 
position of mismatch. Yet, we do not know at which node traversal will halt. The shorter label may be split 
over more edges than the longer label spans, by Lemma [T] 

We extended Algorithm [U to perform dictionary matching. Pseudocode of our program is delineated in 
Algorithm 111 with its submodules extracted to Algorithms |3] and HI Our program reports the longest pattern 
occurrence that ends at each text position. When a pattern is a suffix of a longer pattern, and the longer 
pattern occurs in the text, we do not spend time reporting an occurrence of the shorter pattern. 

A key challenge in implementing dictionary matching on the suffix tree is the scenario in which one 
pattern is a proper substring of another pattern [IJ. Traversing the suffix tree using suffix links (as in 
Algorithm |2]i, these pattern occurrences can be passed unnoticed in the text. This limitation is addressed by 
augmenting each node of the suffix tree with a pointer to the longest prefix of the label along its path from 
the root that is a complete pattern. The nodes are marked with this information in linear time by a depth-first 
traversal of the suffix tree. 

The suffix tree is a versatile tool in string algorithms, and is already needed in many applications to fa- 
cilitate other queries. Thus, in practice, our linear-time dictionary matching program with the uncompressed 
suffix tree requires very little additional space. This tool is itself a contribution, allowing efficient dictionary 
matching in small space, however, we improved this application by using a compressed suffix tree as the 
underlying data structure. 

4 Dictionary Matching with Compressed Suffix Tree 

In this section we describe how we redesigned our dictionary matching code to run over a compressed suffix 
tree in linear time, overlooking the slowdown of queries on the compressed suffix tree. Since the existing 
compressed suffix suffix tree construction algorithms are not online algorithms, it is not possible to build the 
compressed suffix tree incrementally, inserting one pattern at a time. Instead, the dictionary is merged into a 
single string by concatenating the patterns with a unique delimiter between them. We used the Succinct Data 
Structures Library (SDSL)Q since it provides a C-i~i- implementation of a variety of compressed suffix tree 
representations and it was proven to be more efficient than previous compressed suffix tree implementations 

uni. 

Although the ultimate capability of the compressed suffix tree is modeled after the functionality of its 
uncompressed counterpart, many operations that are straightforward in the uncompressed suffix tree require 
creativity in the compressed data structures. Understanding how the suffix tree components are represented 
in the compressed variation is a necessary prerequisite to implementing seemingly straightforward naviga- 
tional tasks. Furthermore, the compressed suffix tree is a self-index and allows us to discard the original 
set of patterns. Thus, we had to figure out which component data structure to query in order to randomly 
access a single pattern character. For instance, announcing a pattern occurrence (Algorithm |2] line |25]l is 

' |http : / / simongog. github . com/ sdsl/| 
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Algorithm 2 Dictionary matching over the generalized suffix tree 

1: curNode root 

2: textlndex 4— 

3: curNodelndex •(— 

4: skipcount 

5: usedSkipcount /aZse 

6: repeat 

7: lastNode curNode 
8: if usedSkipCount ^ true then 
9: textlndex+ =curNodeIndex 
10: curNodelndex 

11: curNode <— curNode. child(text[textlndex]) 
12: ii curNode.length> then 

13: curNodeIndex++ {already compared the first character on the edge} 

14: end if 
15: else 

16: usedSkipCount i— false 
17: end if 

18: {compare text} 

19: while curNodeIndex<curNode. length AND curNodeIndex+textIndex<text.length do 

20: if text[textlndex + curNodelndex] ^ pat[curNode.stringNum][curNode.beg + curNodelndex] 
then 

21: break {mismatch} 

22: end if 
23: curNode Index-\ — h 

24: end while 

25: if curNodeIndex=curNodeJength AND curNode.firstLeafO then 
26: announce pattern occurrence 
27: end if 

28: if cur N odelndex = cur Node. length AND cur Node. length > AND text[textIndex+curNodeIndex— 

1] = pat[curN ode. stringNum][cur Node. beg + curNodelndex — 1] then 
29: continue {branch and continue comparing text to patterns} 
30: end if 
31: handleMismatch 

32: until textlndex+ curNodelndex > textJength {scan entire text} 
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Algorithm 3 Handling a Mismatch 



if curNode. depth ^ OR lastN ode. depth ^ then 

if curNode. suffixLink = root AND lastNode.suffixLink^ root then 
curNode lastNode 

curNodelndex curNode. length {mismatched when trying to branch} 
textlndex — = curNode. length 
end if 

if curNode.parent = root AND curNodelndex = 1 then 
textlndex++ 
curNodelndex = 

curNode = curNode.parent 

continue {when traverse suffix hnk: will be at mismatch, so skip 1 char} 
end if 

useSkipcountTrick(skipcount, curNode) 
else 

{mismatch at root} 
textlndex++ 
end if 



Algorithm 4 Skip-Count trick 
repeat 

curNode ■(— curNode. suffixLink 
usedSkipCount true 
textPos = curNodelndex+textlndex 
skipcount curNode. length — curNodelndex 
ifskipcount > curNode. length then 
if curNode. length = then 

usedSkipCount false {branch at next iteration of outer loop, look for next text char} 

curNodelndex <— 

skipcount 
else 

if skipcount = curNode. length then 

curNodelndex 

usedSkipCount false {branch at next iteration of outer loop} 
end if 

skipcount — = curNode. length 
curNode curNode.parent 
end if 
else 

curNodelndex curNode. length — skipcount 

skipcount 
end if 
until skipcount < 
textlndex = textPos — curNodelndex 
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Algorithm 5 Announcing Pattern Occurrence in CST 

a getCharAtNodePos(curNode, curNodelndex) = END_OF_STRING_MARKER then 

pos ^ csa[lb{curNode)] — 1 

{lb{v) returns the left bound of node v in the suffix array} 

{pos is dictionary index immediately preceding this leaf's ancestor emanating from wot} 
if pos< then 

occ true {beginning of first pattern} 
else 

c getCharAtPatternPos(pos) 

if c = END_OF_STRING_MARKER then 

occ ^ true {beginning of some pattern after first} 
end if 
end if 
end if 



not simply a question of checking whether traversal has reached the end of a leaf representing the first suffix 
of a pattern. A simple if statement is replaced by the segment of pseudocode delineated in Algorithm |5] and 
described in the following paragraph. 

Instead of an if statement that checks properties of a leaf, we perform the following computation, involv- 
ing several function calls, to determine if a pattern occurrence has been located in the text. When traversing 
the compressed suffix tree according to the text, a mismatch along an edge leading into a leaf may in fact 
be a pattern occurrence. Thus, we first check if the mismatch is a string delimiter, which mismatches every 
text character. Then, we determine if this leaf represents the first suffix of some pattern. This is done by 
finding out which character precedes the beginning of this leaf's path from the root. If the path begins at the 
beginning of the dictionary, this leaf represents the first suffix of the first pattern, and a pattern occurrence is 
announced. Similarly, if the character at that position is a pattern delimiter, the suffix is a complete pattern, 
and a pattern occurrence is announced. 

The skip-count trick we described in the previous section enables us to navigate the compressed suffix 
tree while processing the text in linear time. When we use this technique and traverse suffix links to find 
pattern occurrences in the text, some pattern occurrences can pass unnoticed. This concern is limited to 
a dictionary in which one pattern is a proper substring of another. Consider the suffix tree in Figure [T] 
for the dictionary of patterns D={a, ate, bath, later}. Two of the patterns in the dictionary are 
substrings of other patterns. If the text contains the word lately, an occurrence of the pattern ate should 
be identified within this word. However, using suffix links, we navigate from the node labeled later to the 
node labeled ater to the node labeled ter, without recognizing an occurrence of ate. This is because we 
are looking for the longer pattern later. 

In the uncompressed suffix tree, we mark nodes that are pattern occurrences and preprocess the suffix 
tree with a depth-first traversal so that lowest marked ancestor (LMA) queries can be answered in constant 
time. Then, an LMA query at each traversal of a suffix link ensures that no pattern occurrence is skipped 
over by the skip-count trick. In a compressed suffix tree, this is not as straightforward since the nodes are not 
stored as independent entities. Thus, we implemented a framework for answering lowest marked ancestor 
queries in constant time that consists of bit arrays and sequences of balanced parentheses. We coded this 
framework with the compressed suffix tree in mind. Yet, it is suitable for any compressed representation 
of an ordered tree that represents the nodes as a sequence of balanced parentheses. This is a more general 
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Dictionary = {a, ate, bath, later} 




Figure 1: Suffix tree for a dictionary in which two patterns are proper substrings of other patterns. Two nodes are 
marked. A depth-first search is performed on the nodes to set up arrays M and D, depicted above the suffix tree along 
with the balanced parentheses representation of the tree structure. 



contribution of this project. 

We built a succinct framework that answers LMA queries in constant time by augmenting the com- 
pressed suffix tree with a bit-array, M, and a sequence of balanced parentheses, D. M and D are populated 
by a depth-first traversal once the compressed suffix tree is fully constructed. The bit array M stores two 
bits per suffix tree node. The suffix tree is traversed in depth-first order and a 1 is stored in each bit that 
represents a marked node. The sequence of balanced parentheses D denotes the relationship between the 
marked nodes in the suffix tree, also in depth-first search order. D is stored as a bit-array, with two bits 
per marked node, in which 1 stands for '(' and stands for ')'. We use bit array B to refer to the balanced 
parentheses representation of the nodes in the compressed suffix tree. 

The first step in performing an LMA query on a node x in the compressed suffix tree is to find out if x 
is marked in M. If the node is marked, it is its own LMA. If it is not a marked node, we locate the closest 
marked bit to the left of the node in M, which we call y. If y represents the first visit to a node, an open 
parenthesis in B, y represents the lowest marked ancestor of x. Otherwise, the lowest marked ancestor of x 
corresponds to the closest marked bit enclosing y. To find the lowest marked ancestor in this case, we map 
y from M to D and find the first open parenthesis that precedes its open parenthesis in D. This procedure is 
delineated in Algorithmic] 

We refer to the suffix tree in Figure [T]for illustrative examples. The LMA of the node labeled a is itself 
since its bit, M[l], is marked in M. The LMA of the node labeled ater, represented by M[8], is the node 
labeled ate, which corresponds to y = 5, since B[5] is an open parenthesis. The LMA of the node labeled 
ath, represented by M[ll], is the node labeled a, which corresponds to M[l], since y = 10, B[10] is a 
close parenthesis in, and position 1 is the open parenthesis of the node that encloses M[10] in D. 

Algorithm |6] performs constant-time lowest marked ancestor (LMA) queries on the compressed suffix 
tree and consists of a set of operations on bit arrays and sequences of balanced parentheses. We use data 
structures that answer rank [23 1 and select [6 1 queries on bit arrays in constant time and find_open, find_close, 
and enclose queries on sequences of balanced parentheses in constant time 1221 . A rank query, rank(i), 
returns the number of I's in the first i positions of the array. A select query, select(i), finds the position 
of the ith 1 in the bit array. find_open(i) and find_close(i) queries find the matching parenthesis for the 
parenthesis at position i. enclose(i) finds the closest enclosing pair of parentheses to the parenthesis at 
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Algorithm 6 Lowest Marked Ancestor Query on node in CST 

{returns root if node has no marked ancestor} 

{rank and select queries assume that the bit- array is 0-based} 

if M[node]=l then 

return node {node is marked} 
else 

prejy <— M.rank(node+l) 
if prejy = then 

return root 
else 

y M.select(j?re_y) — 1 
if B[y]=l then 

return y {yis the LMA since B[j/]='(' } 
else 

yl M.rank(2/) {coresponding index in D} 
y2 D.find_open(j/l) 
y2> D.enclose(y2) 
if y3 = NULL then 

return root {no enclosing parentheses } 
else 

y4 M.select(y3 + 1) - 1 {map from D to M} 
retum y4 
end if 
end if 
end if 
end if 



10 



position i. We used efficient implementations of these data structures that are included in the Succinct Data 
Structures Library. 

5 Experimental Results 

We implemented the algorithms in C++ and ran experiments on computers that feature an Intel(R) Xeon(R) 
processor at 2.93 GHz, with 5 GB of RAM, running Linux kernel version 2.6.32. For one set of experiments, 
we searched a 5 MB English text for common English words in a 2.5 MB dictionary that comes from 
ClueWebO^l and was used in f3]. We performed another set of experiments on biological data. We searched 

5 MB of the human genome for patterns in a 3 MB dictionary of promoter sequences in the human genome^. 
The texts are 5 MB of DNA and 5 MB of Enghsh text from the Pizza&Chili corpufl 

We used the framework of Sadakane's compressed suffix tree [21], cst_sada, in our experiments since 
it stores nodes as a sequence of balanced parentheses and we were able to augment it for constant-time 
lowest marked ancestor queries. Configurations of cst_sada with newer representations of its components 
beat the runtime of configurations of the other types of compressed suffix trees on almost all operations 
and its space savings is of comparable significance lITOl . In particular, the navigational operations are very 
fast. Sadakane's CST consists of a compressed suffix array, a compressed LCP array, and a navigational 
structure. We ran experiments on four different variations of Sadakane's CST using two different types of 
compressed suffix arrays, csa_sada and csa_wt, and two different types of compressed LCP arrays, lcp_dac 
and lcp_support_tree2. We also ran our experiments on an uncompressed suffix tree. The csa_sada class 
is a very clean reimplementation of Sadakanes compressed suffix array |[T9l and the csa_wt class is based 
on a wavelet tree. The lcp_dac class uses the direct accessible code solution of Brisaboa et al. [4], which 
represents the LCP array in suffix array order, and lcp_support_tree2 uses a tree compressed representation 
of the LCP array, which is based on the topology of the compressed suffix tree. 

We compare the time-space trade-off of dictionary matching using different variations of ID dictionary 
matching software. For a baseline, we use uncompressed components, which consume the most space 
but perform operations in constant time. The remaining runs use different underlying representations of 
compressed suffix arrays and compressed LCP arrays as components. 

Compressed suffix trees conserve a considerable amount of space while the sacrifice is a negligible 
slowdown in running time. This is illustrated in Figure |2]and in Tables |2] and [3] 

6 Conclusion 

We have introduced dictionary matching software that runs in small space. Its underlying data structure 
is the compressed suffix tree. This program runs in linear time, disregarding the slowdown of querying 
the compressed self-index. We have shown that our implementation conserves considerable space in prac- 
tice. Our software includes a space-efficient technique for performing lowest marked ancestor queries on 
compressed suffix trees, a contribution that is useful for many other applications. 

We would like to extend our small-space dictionary matching software to accommodate a dynamically 

jhttp : / / lemurpro ject .org/ cluewebO 9/| 

^http : //epd . vital- it ■ ch] 

" ^http ://pizzachili. dec . uchile . cl| 
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Figure 2: Time-space trade-offs of different representations of the compressed suffix array and the compressed LCP 
array in Sadakane's compressed suffix tree for dictionary matching. The uncompressed is the outHer in space con- 
sumption. The four compressed versions have similar time and space complexities. 



changing set of patterns in the dictionary. Several dynamic compressed suffix tree representations have been 
presented |T, 17] but they lack implementations. Extending this work to the dynamic setting would begin 
by implementing the dynamic compressed suffix tree to accommodate insertion, deletion, and modification 
of dictionary patterns, without rebuilding the index of the entire dictionaiy. 
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Experiments on DNA: 3 MB dictionary and 5 MB text 


CST components 


Space 


Preprocessing Time 


Searching Time 


uncompressed 


26.7 MB 


8 sec 


2.05 hours 


sada_dac 


8.8 MB 


19 sec 


2.08 hours 


sada_st2 


7.7 MB 


21 sec 


2.07 hours 


wt_dac 


6.5 MB 


29 sec 


2.08 hours 


wt_st2 


5.4 MB 


30 sec 


2.09 hours 



Table 2: Time-space trade-offs of using different representations of the compressed suffix array and the compressed 
LCP array in Sadakane's compressed suffix tree for searching 5 MB of the human genome for promoter sequences 
that comprise a 3 MB dictionary. 13 pattern occurrences were found in the text. 



Experiments on English text: 2.5 MB dictionary and 5 MB text 


CST components 


Space 


Preprocessing Time 


Searching Time 


uncompressed 


23.1 MB 


7 sec 


3.63 hours 


sada_dac 


7.8 MB 


17 sec 


3.63 hours 


sada_st2 


6.6 MB 


17 sec 


3.64 hours 


wt_dac 


6.2 MB 


39 sec 


3.64 hours 


wt_st2 


5.0 MB 


40 sec 


3.63 hours 



Table 3: Time-space trade-offs of using different representations of the compressed suffix array and the compressed 
LCP array in Sadakane's compressed suffix tree for searching 5 MB of Enghsh text for common English words in a 
2.5 MB dictionary. 12,717 pattern occurrences were located in the text. 
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